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Hyperspectral  Image  Classification  via 
Kernel  Sparse  Representation 

Yi  Chen,  Nasser  M.  Nasrabadi,  Fellow,  IEEE ,  and  Trac  D.  Tran,  Senior  Member,  IEEE 


Abstract — In  this  paper,  a  novel  nonlinear  technique  for  hy¬ 
perspectral  image  (HSI)  classification  is  proposed.  Our  approach 
relies  on  sparsely  representing  a  test  sample  in  terms  of  all  of  the 
training  samples  in  a  feature  space  induced  by  a  kernel  function. 
For  each  test  pixel  in  the  feature  space,  a  sparse  representation 
vector  is  obtained  by  decomposing  the  test  pixel  over  a  training 
dictionary,  also  in  the  same  feature  space,  by  using  a  kernel-based 
greedy  pursuit  algorithm.  The  recovered  sparse  representation 
vector  is  then  used  directly  to  determine  the  class  label  of  the 
test  pixel.  Projecting  the  samples  into  a  high-dimensional  fea¬ 
ture  space  and  kernelizing  the  sparse  representation  improve  the 
data  separability  between  different  classes,  providing  a  higher 
classification  accuracy  compared  to  the  more  conventional  linear 
sparsity-based  classification  algorithms.  Moreover,  the  spatial  co¬ 
herency  across  neighboring  pixels  is  also  incorporated  through  a 
kernelized  joint  sparsity  model,  where  all  of  the  pixels  within  a 
small  neighborhood  are  jointly  represented  in  the  feature  space 
by  selecting  a  few  common  training  samples.  Kernel  greedy  opti¬ 
mization  algorithms  are  suggested  in  this  paper  to  solve  the  kernel 
versions  of  the  single-pixel  and  multi-pixel  joint  sparsity-based 
recovery  problems.  Experimental  results  on  several  HSIs  show 
that  the  proposed  technique  outperforms  the  linear  sparsity-based 
classification  technique,  as  well  as  the  classical  support  vector 
machines  and  sparse  kernel  logistic  regression  classifiers. 

Index  Terms — Classification,  hyperspectral  imagery,  joint  spar¬ 
sity  model,  kernel  methods,  sparse  representation. 

I.  Introduction 

HYPERSPECTRAL  imaging  sensors  capture  images  in 
hundreds  of  continuous  narrow  spectral  bands,  spanning 
the  visible  to  infrared  spectrum.  Each  pixel  in  a  hyperspectral 
image  (HSI)  is  represented  by  a  vector  whose  entries  corre¬ 
spond  to  various  spectral-band  responses.  Different  materials 
usually  reflect  electromagnetic  energy  differently  at  specific 
wavelengths.  This  enables  discrimination  of  materials  based 
on  their  spectral  characteristics.  One  of  the  most  important 
applications  of  HSI  is  image  classification,  where  pixels  are 
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labeled  to  one  of  the  classes  based  on  their  spectral  char¬ 
acteristics,  given  a  small  set  of  training  data  for  each  class. 
Various  techniques  have  been  developed  for  HSI  classification. 
Among  the  previous  approaches,  the  support  vector  machine 
(SVM)  [1],  [2]  has  proven  to  be  a  powerful  tool  to  solve 
many  supervised  classification  problems  and  has  shown  good 
performances  in  hyperspectral  classification,  as  well  [3]— [5]. 
Variations  of  SVM-based  algorithms  have  also  been  proposed 
to  improve  the  classification  accuracy.  These  variations  include 
transductive  SVM,  which  exploits  both  labeled  and  unlabeled 
samples  [6],  and  SVM  with  composite  kernels  (CKs),  which 
incorporates  spatial  information  directly  in  the  SVM  kernels 
[7].  Multinomial  logistic  regression  [8]  is  another  widely  used 
classifier,  which  uses  the  logistic  function  to  provide  the  pos¬ 
terior  probability.  A  fast  algorithm  for  sparse  multinomial 
logistic  regression  has  been  developed  in  [9]  and  successfully 
adopted  for  HSI  segmentation  in  [10],  [11].  Some  of  the  other 
recent  HSI  classification  techniques  can  be  found  in  [12]— [17]. 
In  these  recent  methods,  a  feature  extraction  strategy  is  pro¬ 
posed  in  [12]  for  classification  which  generalizes  the  linear  dis¬ 
criminative  analysis  and  nonparametric  discriminative  analysis. 
In  [13],  the  derivative  information  of  the  spectral  signatures 
is  exploited  as  features,  and  then  decisions  obtained  from 
spectral  reflectance  and  derivative  information  are  fused  for 
the  final  decisions.  In  [14],  each  image  band  is  decomposed 
into  intrinsic  mode  functions  (IMFs)  which  are  adaptive  to 
local  properties  via  empirical  mode  decomposition  and  then 
SVM  is  applied  to  the  lower  order  IMFs  for  classification.  In 
[15],  the  ^-nearest-neighbor  classifier  is  applied  to  the  local 
manifolds  to  exploit  the  intrinsic  nonlinear  structure  of  HSIs. 
A  semi- supervised  classification  algorithm  is  proposed  in  [16] 
in  order  to  use  a  kernel  machine  which  is  iteratively  updated  by 
manifold  regularization.  In  [17]  the  results  from  multiple  clas¬ 
sification/segmentation  techniques  are  fused  by  postprocessing 
to  generate  the  final  spectral-spatial  classification  map.  Most 
of  the  aforementioned  HSI  image  classification  techniques  do 
not  directly  incorporate  the  spatial  or  the  contextual  information 
into  the  classifier. 

Recently,  sparse  representation  [18],  [19]  has  also  been  pro¬ 
posed  to  solve  many  computer  vision  tasks  [20]-[25],  where 
the  usage  of  sparsity  as  a  prior  often  leads  to  state-of-the-art 
performance.  Sparse  representation  has  also  been  applied  to 
HSI  target  detection  and  classification  [26]-[28],  relying  on 
the  observation  that  hyperspectral  pixels  belonging  to  the  same 
class  approximately  lie  in  the  same  low-dimensional  subspace. 
Thus,  an  unknown  test  pixel  can  be  sparsely  represented  by 
a  few  training  samples  (atoms)  from  a  given  dictionary,  and 
the  corresponding  sparse  representation  vector  will  implicitly 
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encode  the  class  information.  The  sparse  representation-based 
classifier  is  different  from  the  conventional  sparse  classifier 
SVM  in  the  following  aspects.  SVM  is  a  discriminative  model, 
while  the  sparse  representation  method  can  be  viewed  as  a 
generative  model,  where  the  signal  (pixel)  is  expressed  as  a 
linear  combination  of  atoms  [19].  SVM  is  a  binary  classi¬ 
fier  that  finds  the  separating  hyperplane  between  two  classes 
(multi-class  SVM  requires  a  one-against-one  or  one-against- 
all  strategy).  The  sparse  representation-based  classifier  is  from 
a  reconstruction  point  of  view.  The  sparse  decomposition  of 
the  test  pixel  over  the  entire  dictionary  implicitly  leads  to 
a  competition  between  the  subspaces  (classes),  and  thus  the 
recovered  sparse  representation  is  discriminative.  Moreover,  in 
SVM,  there  is  an  explicit  training  stage.  The  SVM  classifier 
is  trained  only  once,  and  then  this  classifier  with  its  fixed 
sparse  support  vectors  is  used  to  classify  all  of  the  test  data. 
On  the  other  hand,  in  our  proposed  approach,  a  new  sparse 
representation  vector  is  extracted  for  each  test  pixel  and  is  thus 
adaptive,  representing  the  sparsely  selected  atoms  which  are 
adapted  to  reconstruct  the  current  test  pixel. 

HSIs  are  usually  smooth  in  the  sense  that  the  pixels  in  a  small 
neighborhood  represent  the  same  material  and  have  similar 
spectral  characteristics.  Various  techniques  have  been  proposed 
recently  to  exploit  the  contextual  correlation  within  HSI  which 
have  notably  improved  the  classification  and  segmentation  per¬ 
formance.  Post-processing  procedures  are  used  in  [29],  [30] 
on  the  individually  labeled  samples  based  on  certain  decision 
rules  to  impose  the  smoothness.  Markov  random  fields  exploit 
the  statistical  dependency  among  neighboring  pixels  and  are 
usually  applied  in  Bayesian  approaches  [11].  The  CK  approach 
[7]  is  another  way  to  incorporate  the  spatial  information,  which 
explicitly  extracts  spatial  information  for  each  spectral  pixel 
and  then  combines  the  spectral  and  spatial  information  via  ker¬ 
nel  composition.  Joint  sparsity  model  (JSM)  [31]  is  exploited  in 
sparsity -based  HSI  target  detection  and  classification  [27],  [28], 
where  the  neighboring  pixels  are  simultaneously  represented  by 
a  sparse  linear  combination  of  a  few  common  training  sam¬ 
ples.  Each  pixel,  although  sharing  the  same  common  support, 
might  have  weighting  coefficients  taking  on  different  values. 
In  this  way,  the  smoothness  across  neighboring  spectral  pixels 
is  enforced  directly  in  the  classification  stage,  and  no  post¬ 
processing  steps  are  performed.  The  details  of  CKs  and  the  JSM 
will  be  further  discussed  in  the  following  sections. 

It  is  well-known  that  for  the  classical  HSI  image  classifica¬ 
tion  and  target  detection  algorithms,  the  use  of  kernel  methods 
yields  a  significant  performance  improvement  [5],  [32],  because 
the  kernel-based  algorithms  implicitly  exploit  the  higher  order 
structure  of  the  given  data  which  may  not  be  captured  by  the 
linear  models.  Therefore,  if  the  data  set  is  not  linearly  separable, 
kernel  methods  [33]-[36]  can  be  applied  to  project  the  data 
into  a  nonlinear  feature  space  in  which  the  data  becomes  more 
separable.  In  practical  implementation,  the  kernel  trick  [37]  is 
often  used  in  order  to  avoid  explicitly  evaluating  the  data  in  the 
feature  space. 

In  this  paper,  we  propose  a  new  HSI  classification  algorithm 
based  on  kernel  sparse  representation  by  assuming  that  a  test 
pixel  can  be  linearly  represented  by  a  few  training  samples 
in  the  feature  space.  The  kernel  sparse  representation  vector 


is  then  obtained  by  decomposing  the  test  pixel  represented  in 
a  high-dimensional  feature  space  over  a  structured  dictionary 
consisting  of  training  samples  from  all  of  the  classes  in  the  same 
feature  space.  The  recovered  sparse  vector  is  used  directly  for 
classification.  Although  the  proposed  approach  has  a  similar 
formulation  as  previous  kernel  regression  approaches  with  a 
sparse  prior  such  as  kernel  matching  pursuit  [33],  kernel  basis 
pursuit  [34],  and  generalized  LASSO  [38],  the  underlying  ideas 
are  quite  different.  The  objective  of  these  previous  approaches 
is  to  approximate  a  function  as  a  linear  combination  of  dictio¬ 
nary  functions,  which  are  the  kernels  centered  at  the  training 
points,  by  minimizing  certain  loss  function  evaluated  at  these 
training  points  and  subject  to  a  sparsity  prior.  Therefore,  the 
target  vector  for  fitting  consists  of  the  observations  of  the 
function  value  at  the  training  points,  and  the  dictionary  is  then 
the  dictionary  functions  evaluated  at  the  training  points  which 
turns  out  to  be  the  kernel  matrix.  In  our  proposed  approach, 
the  target  vector  is  the  test  pixel  itself  in  the  feature  space. 
It  is  not  the  similarity  measure  between  the  test  sample  and 
training  samples  and  may  not  have  an  explicit  expression.  The 
dictionary  also  consists  of  the  training  samples  in  the  feature 
space  and  cannot  assume  an  explicit  expression  either.  The 
recovered  sparse  representation  vector  can  be  viewed  as  a 
discriminative  feature  extracted  from  the  test  pixel  and  is  used 
directly  for  classification. 

The  contextual  correlation  between  pixels  within  a  small 
spatial  neighborhood  can  be  incorporated  into  the  kernel  sparse 
representation  through  the  JSM  [31],  where  all  neighboring 
pixels  are  simultaneously  represented  by  a  linear  combination 
of  a  few  common  training  samples  in  the  feature  space.  Further¬ 
more,  the  CK  approach  [7]  can  also  be  used  with  the  proposed 
kernel  sparse  representation  model  in  order  to  combine  spectral 
and  spatial  information.  Efficient  kernel-based  optimization 
algorithms  are  discussed  in  this  paper  for  the  recovery  of  the 
kernel  sparse  representations  for  both  single-pixel  and  multi¬ 
pixel  JSMs. 

Notation-wise,  vectors  and  matrices  are  denoted  by  lower 
and  upper  case  bold  letters,  respectively.  For  a  vector  a  E  RN 
and  an  index  set  A  C  {1, ... ,  TV}  with  |A|  =  t,  a  a  E  is  the 
portion  of  a  indexed  on  A.  For  a  matrix  S  E  index 

sets  Ai  C  {1, . . . ,  Ni}  with  |Ai|  =  and  A2  C  {1, . . . ,  7V2} 
with  | A2 1  =  t2,  Sai,:  E  MtlxAr2  is  a  submatrix  of  S  consisting 
of  the  t\  rows  in  S  indexed  on  Ai,  *S':?a2  E  MiVlXt2  consists 
of  the  t2  columns  in  S  indexed  on  A2,  and  *S'a1?a2  E  RtlXt2  is 
formed  by  the  rows  and  columns  of  S  indexed  on  Ai  and  A2, 
respectively. 

The  remainder  of  this  paper  is  structured  as  follows. 
Section  II  briefly  introduces  the  sparsity-based  HSI  classifica¬ 
tion  technique.  Section  III  defines  the  sparsity  models  in  the 
feature  space,  then  discusses  how  to  incorporate  spatial  infor¬ 
mation,  and  describes  the  kernel  sparse  recovery  algorithms. 
Experimental  results  are  shown  in  Section  IV,  and  conclusions 
are  drawn  in  Section  V. 

II.  Sparsity-Based  HSI  Classification 

This  section  briefly  introduces  the  sparsity-based  algorithm 
for  HSI  classification,  and  more  details  can  be  found  in 
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[26]— [28].  It  is  assumed  that  the  spectral  signatures  of  pixels 
belonging  to  the  same  class  approximately  lie  in  the  same  low¬ 
dimensional  subspace.  Thus,  an  unknown  test  sample  x  G  RB , 
where  B  is  the  number  of  spectral  bands,  can  be  written  as  a 
sparse  linear  combination  of  all  of  the  training  pixels  as 


X  =  Aol  (1) 

where  A  =  [a\  . . .  a  at]  G  RBxN  is  a  structured  dictionary 

whose  columns  {a^} -=1  2  N  are  N  training  samples  (referred 
to  as  atoms)  from  all  classes,  and  ol  G  RN  is  an  unknown  sparse 
vector.  The  index  set  on  which  ol  have  nonzero  entries  is  the 
support  of  ol.  The  number  of  nonzero  entries  in  ol  is  called 
the  sparsity  level  if  of  a  and  denoted  by  K  =  ||a||o.  Given 
the  dictionary  A,  the  sparse  coefficient  vector  ol  is  obtained  by 
solving 

ex  =  arg  min  \\x  —  Act  \ | 2  subject  to  || o: ||o  <  K0  (2) 

where  Ko  is  a  preset  upper  bound  on  the  sparsity  level.  The 
problem  in  (2)  is  NP-hard,  which  can  be  approximately  solved 
by  greedy  algorithms,  such  as  orthogonal  matching  pursuit 
(OMP)  [39]  or  subspace  pursuit  (SP)  [40].  The  class  label  of 
x  is  determined  by  the  minimal  residual  between  x  and  its 
approximation  from  each  class  subdictionary: 


Class(cc)  =  arg  min  \\x  -  A:o  do  |L  (3) 


where  Qm  C  {1,2,...,  N}  is  the  index  set  associated  with  the 
training  samples  belonging  to  the  rath  class.  As  pointed  out  in 
[25],  the  sparse  representation-based  classifier  can  be  viewed  as 
a  generalization  of  the  nearest  neighbor  classifier  [41]. 

In  HSI,  pixels  within  a  small  neighborhood  usually  consist 
of  similar  materials  and,  thus,  their  spectral  characteristics  are 
highly  correlated.  The  spatial  correlation  between  neighboring 
pixels  can  be  incorporated  through  a  JSM  [27],  [31]  by  assum¬ 
ing  the  underlying  sparse  vectors  associated  with  these  pixels 
share  a  common  sparsity  pattern  as  follows.  Let  {xt}t=1  T 
be  T  pixels  in  a  spatial  neighborhood  centered  at  X\.  These 
pixels  can  be  compactly  represented  as 

X  =  \x±  X2  ...  xt]  =  [Aol  1  A0L2  •  •  •  Aolt] 

=  A  [oli  ol 2  . . .  cxt]  =  AS .  (4) 

s 

In  the  JSM,  the  sparse  vectors  {ctt}t=1  T  share  the  same 
support  A,  and,  thus,  S  is  a  sparse  matrix  with  only  |  A|  nonzero 
rows.  The  row- sparse  matrix  S  can  be  recovered  by  solving  the 
following  optimization  problem 


S  =  arg  min  \  \X  —  AS  \  \  f  subjectto  ||S||rOw,0  <  (5) 

where  ||  aS'  ||  row,o  denotes  the  number  of  non-zero  rows  of  S , 
and  ||  •  ||  p  denotes  the  Frobenius  norm.  The  problem  in  (5)  can 
be  approximately  solved  by  the  simultaneous  versions  of  OMP 
(SOMP)  [31]  or  SP  (SSP)  [28].  The  label  of  the  center  pixel  xx 
is  then  determined  by  the  minimal  total  residual 


Class  (a3i)  =  arg  min 


X  -  A 


'•  1 


(6) 


where  ||  •  ||  p  denotes  the  Frobenius  norm. 


III.  Kernel  Sparse  Representation 

If  the  classes  in  the  dataset  are  not  linearly  separable,  then 
the  kernel  methods  can  be  used  to  project  the  data  into  a 
feature  space,  in  which  the  classes  become  linearly  separable 
[1].  The  kernel  function  k,  :  RB  x  \->  M  is  defined  as  the 
inner  product 


ft(xi ,  Xj )  —  (jj){xi) ,  .  (7) 

Commonly  used  kernels  include  the  radial  basis  function  (RBF) 
kernel  K(xi,Xj)  =  exp(— 7|| Xi  —  Xj\\2)  with  7  >  0  control¬ 
ling  the  width  of  the  RBF,  and  order-d  homogeneous  and  in¬ 
homogeneous  polynomial  kernels  K,(xi,Xj)  =  (xi  •  Xj)d  and 
K,(xi,Xj)  =  (xi  •  Xj  +  l)d ,  respectively.  In  this  section,  we 
describe  how  the  sparsity  models  in  Section  II  can  be  extended 
to  a  feature  space  induced  by  a  kernel  function. 

A.  Pixel-Wise  Sparsity  in  Feature  Space 

Let  x  G  be  the  data  point  of  interest  and  (j){x)  be  its 
representation  in  the  feature  space.  The  kernel  sparse  repre¬ 
sentation  of  a  sample  x  in  terms  of  training  atoms  a^s  can  be 
formulated  as 

=  [0(ai)  . . .  <p(aN)]  [ai  . . .  a'N]T  =  A^ol  (8) 

' - v - /v - V - ' 

A 0  ex' 

where  the  columns  of  A $  are  the  representations  of  training 
samples  in  the  feature  space  and  a.'  is  assumed  to  be  a  sparse 
vector. 

Similar  to  the  linear  sparse  recovery  problem  in  (2),  ol'  can 
be  recovered  by  solving 

ol  =  arg  min  ||0(cc)  —  A0o/||2  subjectto  ||  oc/  ||o  ^  K0. 

(9) 

The  problem  in  (9)  can  be  approximately  solved  by  kernelizing 
the  OMP  and  SP  algorithms  (denoted  by  KOMP  and  KSP, 
respectively).  Note  that  in  the  above  problem  formulation,  we 
are  solving  for  the  sparse  vector  ol  directly  in  the  feature  space 
using  the  implicit  feature  vectors,  but  not  evaluating  the  kernel 
functions  at  the  training  points. 

In  KOMP  and  KSP,  essentially,  each  dot  product  operation 
in  OMP/SP  is  replaced  by  the  kernel  trick  in  (7).  Let  K a  G 
kernei  matrix  whose  (i,j) th  entry  is  ft(a^,  aj), 
and  kA,x  £  M N  be  the  vector  whose  ith  entry  is  K{a^x). 
Using  the  feature  representations,  the  correlation  (dot  product) 
between  a  pixel  (j)(x)  and  a  dictionary  atom  0(a^)  is  then 
computed  by 

Ci  =  (0(«),  0(a i))  =  k(x,  =  ( kAix)i  (10) 

the  orthogonal  projection  coefficient  of  (j)(x)  onto  a  set  of 
selected  dictionary  atoms  {<t>{an)}neA  is  given  as 


Pa  =  ((ka)  a, a)  (kA,x)A 


(11) 
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and  the  residual  vector  between  cj)(x)  and  its  approximation 
using  the  selected  atoms  {</>(an)}n€ A  =  ( A ^).  A  is  then  ex¬ 
pressed  as 


B.  Joint  Sparsity  in  Feature  Space 

The  JSM  in  (4)  can  also  be  extended  to  the  feature  space  as 
follows: 


<Kr)  =  <Kx)  -  0U):  A  ((Ka)AjA) _1  (fcA,«)A.  (12) 

Note  that  the  feature  representation  of  the  residual  vector  <fi(r) 
in  (12)  cannot  be  evaluated  explicitly.  However,  the  correlation 
between  </>(r)  and  an  atom  (j)(a,i)  can  be  computed  by 

Ci  =  {(j){r),(j){ai)) 

=  (kA,x)i  ~  (KA)it A  ((Ka)AjA)"1  (kA,m) A.  03) 


The  KOMP  and  KSP  greedy  algorithms,  similar  to  the  lin¬ 
ear  OMP  and  SP  algorithms,  are  used  to  locate  the  support 
A  of  the  sparse  vector  a'.  The  KOMP  algorithm  augments 
the  support  set  by  only  one  index,  which  is  given  by  A  = 
arg  maxi=i? Q  with  q  being  defined  in  (13)  and  </>(r)  being 
the  residual  vector  from  the  previous  iteration,  at  each  iteration 
until  Kq  atoms  are  selected  or  the  approximation  error  (i.e., 
norm  of  the  residual  vector  in  (12))  is  within  a  preset  threshold. 
The  KSP  algorithm  maintains  a  set  of  Kq  indices  with  a 
backtracking  mechanism.  At  each  iteration,  the  index  set  is 
refined  by  adding  K0  new  candidates,  whose  associated  atoms 
have  the  Kq  highest  correlation  (13)  to  the  residual  vector  from 
the  previous  iteration,  to  the  current  list  and  then  discarding 
Ko  insignificant  ones  from  the  list  of  2Kq  candidates.  This 
process  repeats  until  certain  stopping  criterion  is  met.  In  both 
of  the  KOMP  and  KSP  algorithms,  after  the  support  set  A  of 
a.'  is  determined,  the  entries  of  6l  indexed  on  A  are  computed 
by  the  orthogonal  projection  of  the  test  pixel  onto  the  selected 
dictionary  atoms  using  (11).  The  KOMP/KSP  algorithms  can 
be  viewed  as  special  cases,  with  T  —  1,  of  the  kernelized 
SOMP/SSP  (KSOMP/KSSP)  algorithms  (Algorithms  1  and  2) 
proposed  in  the  next  section,  respectively.  The  details  are  thus 
omitted  herein. 

Once  the  sparse  vector  a  is  recovered,  the  residual  between 
the  test  sample  and  the  rath-class  reconstruction  in  the  high¬ 
dimensional  feature  space  is  then  computed  by 


r  m  (a?) 


0(x)  (Afi) 

($(X)  ~  5  ^(x)  ‘  _ 

^k(x,x)  —  2otQm 

+&£jKA)ami  nrad'nm)*  (14) 


X(f,=  [<p(xi)  . . .  (xT)j  =  [A^a'-L  . . . 

=  A^  [a'j  . . .  a!T\  =  A^S'  (16) 

v - v - ' 

S' 

where  the  vectors  {ct't}t=1  T  share  the  same  support.  The 
row-sparse  matrix  S'  is  recovered  by  solving 

S  =  arg  min  1 1 X  $  -  S'  \  \  F  subj  ect  to  1 1  S'  \  \  row,  o  <  Ko  • 

(17) 

In  this  paper,  we  propose  the  KSOMP  and  the  KSSP  algorithms 
in  order  to  approximately  solve  the  above  joint  sparse  recovery 
problem  in  (17). 

In  KSOMP,  at  every  iteration,  the  atom  that  simultaneously 
yields  the  best  approximation  to  all  the  T  pixels  (or  residuals 
after  initialization)  is  selected.  Specifically,  let  C  G  RNxT  be 
the  correlation  matrix  whose  (i,  j) th  entry  is  the  correlation 
between  c/)(ai)  and  0(rj),  where  4>(rj)  is  the  residual  vector 
of  (j)(xj ).  The  new  atom  is  then  selected  as  the  one  associated 
with  the  row  of  C ,  which  has  the  maximal  ^p-norm  for  some 
p  >  1.  The  KSOMP  algorithm  is  summarized  in  Algorithm  1. 
Note  that  when  computing  the  projection  in  (1 1)  and  correlation 
in  (13),  a  regularization  term  XI  is  added  in  order  to  stabilize 
the  matrix  inversion,  where  A  is  typically  a  small  scalar  (e.g., 
A  =  10-5  is  implemented  in  Section  IV)  and  I  is  an  identity 
matrix  whose  dimensionality  should  be  clear  from  the  context. 
The  parameter  A  does  not  seriously  affect  the  classification  per¬ 
formance.  It  is,  however,  necessary  for  stable  matrix  inversion. 
Although  the  kernel  matrix  K a  is  usually  invertible,  (K a) a  a 
may  be  nearly  singular  for  certain  small  index  set  A. 


Algorithm  1:  KSOMP 

Input:  B  x  N  dictionary  A  =  [a\  . . .  ojv],  B  x  T  data  matrix 
X  =  [x i  ...  xt\,  kernel  function  k,  and  a  stopping 
criterion 

Initialization:  compute  the  kernel  matrices  K a  €=  RNxN 
whose  (i,j) th  entry  is  n(xi,Xj)  and  K a,x  G  RNxT 
whose  (i,j) th  entry  is  K>(ai,Xj).  Set  index  set  Ao  = 
arg  max  \\(Ka  x)„-  . II  with  some  p  >  1  and  iteration 

i=l,...,AT  ’ 

counter  t  =  1. 

while  stopping  criterion  has  not  been  met  do 

1)  Compute  the  correlation  matrix  C  G  MiVxT 

C  =  KA,x-(KA):At  i  ((KA)At_i)At_i+A/)_1 

x  (KA,x)At_u. 


where  kA,x  and  Ka  are  as  defined  above,  and  flm  is  the 
index  set  associated  with  the  mth  class.  The  class  label  of  x 
is  determined  as 

Class(cc)  =  arg  min  rm(x).  (15) 


2)  Select  the  new  index  as  Xt  =  arg  max  ||C^?:||  , 

p  >  1 

3)  Update  the  index  set  At  =  At-i  Ui^t} 

4)  t  i —  t  - hi 

end  while 
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Output:  Index  set  A  =  At_i,  the  sparse  representation  S 
whose  nonzero  rows  indexed  by  A  are  SA  .  =  (K a  a  + 
A  I)-\KA,X)A. 


Similarly,  KSSP  is  a  simultaneous  version  of  KSP  where 
the  Kq  atoms  that  best  simultaneously  approximate  all  of  the 
T  residuals  in  terms  of  the  /?p-norm  are  chosen.  The  KSSP 
algorithm  is  summarized  in  Algorithm  2.  Note  that  the  step 
for  computing  the  residual  vectors  (12)  is  incorporated  into 
the  computation  of  the  correlation  vector  in  Step  (1)  of  both 
KSOMP  and  KSSP 


Algorithm  2:  KSSP 

Input:  B  x  N  dictionary  A  =  \a\  ...  ax],  B  xT  data  matrix 
X  =  [x i  ...  xt\,  kernel  function  k,  and  a  stopping 
criterion 

Initialization:  compute  the  kernel  matrices  K A  G  RNxN 
whose  (i,j) th  entry  is  n(xi,Xj)  and  KA,x  G  M7VxT 
whose  (T,j)th  entry  is  n(ai,Xj).  Set  index  set 
A0  =  {Kq  indices  corresponding  to  the  Kq  largest  numbers 
in  \\(KA,x)i .  ||  >  1,  i  =  1, . . . ,  TV},  and  set  iteration 

counter  t  =  1. 

while  stopping  criterion  has  not  been  met  do 

1)  Compute  the  correlation  matrix 

C=KA,X-(KA):M_1  ((KA)At_iiAt_i+Aj)"1 

2)  Find  the  index  set  X  =  {Kq  indices  corresponding 
to  the  Kq  largest  numbers  in  ||C^: \\  ,p  >  1,  i  = 
1,...,7V} 

3)  Update  the  candidate  index  set  At  =  At_i  (J  X 

4)  Compute  the  projection  coefficients  P  = 

((KA)AtjAt  +  A I)-\KA,x)Kt.  €  M2K°xT 

5)  Update  the  index  set  At  =  {Kq  indices  in  At 
corresponding  to  the  Kq  largest  numbers  in  ||P^:||  , 
p>l,i  =  l,,..,N} 

6)  t  i —  t  H-  1 

end  while 

Output:  Index  set  A  =  A*_i,  the  sparse  representation  S 
whose  nonzero  rows  indexed  by  A  are  SA  .  =  (K a  a  + 
A  I)-\KA,X)A. 


Once  the  matrix  S  is  recovered,  the  total  residual  between 
the  T  neighboring  pixels  and  their  approximations  from  the 
rath-class  training  samples  is  computed  by 

rm(x l)  =  (  (^{xiixi)  —  A,x)£imii 

\  1=1 

^  (18) 


where  KA,x  and  K  A  are  as  defined  in  Algorithms  1  and  2, 
and  G  {1,2,...,  TV}  is  the  index  set  associated  with  the 
rath  class.  The  label  for  the  center  pixel  X\  is  then  determined 
by  the  total  residual 

Class  (#1)  =  arg  min  rm(xi).  (19) 


C.  Kernel  Sparse  Representation  With  a  Composite  Kernel 

Another  way  to  address  the  contextual  correlation  within  HSI 
is  through  a  CK  [7],  which  takes  into  account  the  spatial  corre¬ 
lation  between  neighboring  pixels  by  combining  kernels  dedi¬ 
cated  to  the  spectral  and  spatial  information.  The  CK  approach 
has  been  shown  to  significantly  outperform  the  spectral-only 
classifier  in  HSI  classification  [42].  This  method,  although  orig¬ 
inally  proposed  for  SVM,  can  be  readily  incorporated  into  other 
classifiers  which  operate  in  the  feature  space,  such  as  kernel 
logistic  regression  (KLR)  and  the  kernel  sparse  representation- 
based  classifier  proposed  in  this  paper.  Specifically,  let  x™  be 
the  spectral  pixel  at  location  i  in  a  HSI  and  x\  be  the  spatial 
information  extracted  from  a  small  neighborhood  centered  at 
location  i,  which  is  usually  the  mean  and/or  the  standard  devia¬ 
tion  of  the  pixels  within  the  neighborhood.  The  new  pixel  entity 
at  this  location  can  be  redefined  as  Xi  =  {xf,  xf  }.  Note  that  in 
previous  sections,  Xi  contains  only  spectral  information  (i.e., 
Xi  =  xf).  The  spectral  and  spatial  information  can  then  be 
combined  in  a  variety  of  ways,  including  stacking,  direct  sum¬ 
mation,  weighted  summation,  and  cross-information  kernels 
[7].  In  this  paper,  we  consider  the  weighted  summation  kernel, 
which  is  shown  to  yield  the  best  classification  performance 
compared  to  other  types  of  CKs  [7] .  The  kernel  function  in  this 
case  is 

n(Xi,Xj)  =  IIKS  ( X*,Xj )  +  (1  -  n)nw  (x™,X™)  (20) 

where  p  G  (0, 1),  and  ns  and  nw  are  the  kernel  functions  of  the 
spatial  and  spectral  features,  respectively. 

The  CKs  can  be  directly  applied  to  the  pixel-wise  sparsity 
model  in  the  feature  space  in  (8).  The  sparse  representation 
vector  can  be  recovered  using  the  KOMP  or  KSP  algorithm, 
where  the  kernel  matrix  K a  is  now  a  weighted  summation 
of  the  spectral  and  spatial  kernel  matrices  of  the  training 
dictionary  A ,  and  the  vector  kA,x  also  needs  to  be  modified 
accordingly. 

It  is  worth  noting  that  the  CK  approach  is  different  from  the 
kernel  JSM  discussed  in  Section  III-B  in  terms  of  incorporating 
localized  spatial  features.  The  JSM  involves  only  the  spatial 
information  of  the  test  pixels,  and  no  prior  knowledge  about  the 
neighbors  of  the  training  pixels  is  needed.  On  the  other  hand, 
for  the  CKs,  the  local  spatial  information  for  both  training  and 
test  sets  are  necessary.  Moreover,  the  JSM  does  not  assume  a 
sum  or  average  of  the  same  samples,  but  treats  all  pixels  in 
a  small  neighborhood  equally  and  finds  the  sparsity  pattern 
that  simultaneously  represents  these  pixels.  The  CK  approach, 
however,  is  not  limited  to  local  spatial  features  and  can  also  be 
used  to  combine  other  types  of  features  (e.g.,  centroid  of  the 
cluster  as  a  global  feature)  and  information  sources. 
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TABLE  I 

The  16  Ground-Truth  Classes  in  the  AVIRIS  Indian  Pines  Image 


Class 

Samples 

No 

Name 

Train 

Test 

1 

Alfalfa 

6 

48 

2 

Corn-notill 

144 

1290 

3 

Corn-min 

84 

750 

4 

Corn 

24 

210 

5 

Grass/Pasture 

50 

447 

6 

Grass/Trees 

75 

672 

7 

Grass/Pasture-mowed 

3 

23 

8 

Hay-windrowed 

49 

440 

9 

Oats 

2 

18 

10 

Soybeans-notill 

97 

871 

11 

Soybeans-min 

247 

2221 

12 

Soybean-clean 

62 

552 

13 

Wheat 

22 

190 

14 

Woods 

130 

1164 

15 

Building-Grass-Trees-Drives 

38 

342 

16 

Stone-steel  Towers 

10 

85 

Total 

1043 

9323 

IV.  Experimental  Results 

In  this  section,  we  show  the  effectiveness  of  the  proposed 
algorithms  on  classification  of  several  hyperspectral  data  sets. 
For  each  image,  we  solve  the  sparse  recovery  problems  in  (2), 
(5),  (9),  and  (17)  for  each  test  sample,  and  then  determine 
the  class  by  the  minimal  residual  (the  results  are  denoted  by 
OMP/SP,  KOMP/KSP,  SOMP/SSP,  and  KSOMP/KSSP,  respec¬ 
tively).  The  results  of  KOMP  and  KSP  with  CKs,  as  discussed 
in  Section  III-C,  are  denoted  by  KOMPCK  and  KSPCK,  respec¬ 
tively.  The  classification  results  are  then  compared  visually  and 
quantitatively  to  those  obtained  by  the  classical  SVM  classifier 
and  sparse  multinomial  KLR.  For  SVM  and  KFR  classifiers, 
we  use  a  spectral-only  kernel  (denoted  by  SVM/KFR),  as  well 
as  a  CK  (denoted  by  SVMCK/KFRCK).  In  all  classifiers  with 
a  CK,  we  use  a  weighted  summation  kernel,  and  the  spatial 
information  is  the  mean  of  pixels  in  a  small  neighborhood.  The 
parameters  for  KFR,  KFRCK,  SVM,  and  SVMCK  are  obtained 
by  cross-validation. 

The  first  HSI  in  our  experiments  is  the  Airborne  Vis¬ 
ible/Infrared  Imaging  Spectrometer  (AVIRIS)  image  Indian 
Pines  [43].  The  AVIRIS  sensor  generates  220  bands  across 
the  spectral  range  from  0.2  to  2.4  fim.  In  the  experiments, 
the  number  of  bands  is  reduced  to  200  by  removing  20  water 
absorption  bands.  This  image  has  spatial  resolution  of  20  m  per 
pixel  and  spatial  dimension  145  x  145.  It  contains  16  ground- 
truth  classes.  For  each  class,  we  randomly  choose  around  10% 
of  the  labeled  samples  for  training  and  use  the  remaining  90% 
for  testing,  as  shown  in  Table  I  and  Fig.  1.  RBF  kernels  are 
used  in  all  kernel-based  classifiers  (i.e.,  SVM,  SVMCK,  KFR, 
KFRCK,  KOMP,  KSP,  KSOMP,  KSSP,  KOMPCK,  and 
KSPCK).  Since  this  image  consists  of  large  homogenous 
regions,  a  large  spatial  window  of  size  9  x  9  (T  =  81)  is  used 
in  classifiers  with  a  CK  and  the  JSMs  (4)  and  (16). 

The  classification  performance  for  each  of  the  16  classes, 
overall  accuracy  (OA),  average  accuracy  (AA),  and  the  k 
coefficient  measure  [44]  on  the  test  set  are  shown  in  Table  II. 
The  OA  is  the  ratio  between  correctly  classified  test  samples 
and  the  total  number  of  test  samples,  the  AA  is  the  mean  of  the 
16  class  accuracies,  and  the  k  coefficient  is  a  robust  measure 


(a) 

■  Alfalfa 
|  Corn-notill 
Corn-min 
|  Corn 

Grass/Pasture 
|  Grass/T  rees 
|  Grass/Pasture-mowed 
Hay-windrowed 

Fig.  1.  (a)  Training  and  (b)  test  sets  for  the  Indian  Pines  image. 

of  the  degree  of  agreement.  The  classification  maps  on  labeled 
pixels  are  presented  in  Fig.  2,  where  the  algorithm  and  OA 
are  shown  on  top  of  each  corresponding  map.  One  can  clearly 
see  that  incorporating  the  contextual  correlation  and  operat¬ 
ing  in  the  feature  space  both  have  significantly  improved  the 
classification  accuracy.  The  KOMPCK  and  KSPCK  algorithms 
outperform  all  other  classifiers — the  OA  for  both  of  which  are 
greater  than  98%.  The  KSOMP  and  KSSP  algorithms  also  yield 
superior  performance,  which  have  only  about  1%  lower  OA 
than  KOMPCK  and  KSPCK.  Note  that  the  kernel  JSM  for 
KSOMP  and  KSSP  does  not  assume  any  prior  knowledge  of 
the  neighbors  of  the  training  samples  as  the  CK  approach  does. 

The  sparsity  level  Kq  and  RBF  parameter  7  used  in  the 
above  experiments  are  obtained  from  a  small  validation  set.  An 
n-fold  cross  validation  would  not  be  appropriate  for  finding 
the  optimal  sparsity  level,  unless  n  is  large  (e.g.,  leave-one- 
out  cross  validation).  This  is  because  the  sparsity  level  Kq 
is  related  to  the  size  of  dictionary,  therefore,  the  optimal  Ko 
for  part  of  the  dictionary  may  not  be  optimal  anymore  for  the 
entire  dictionary.  Now,  we  examine  how  these  two  parameters 
affect  the  classification  performance  on  the  Indian  Pines  image. 
We  use  randomly  selected  10%  of  all  labeled  samples  as  the 
training  set  and  the  remaining  samples  as  the  test  set,  then  vary 
K0  from  5  to  80  and  7  from  2“3  to  212  in  KOMP,  KSP,  KSOMP, 
and  KSSP.  The  experiment  for  each  7,  K0,  and  each  of  the 
four  algorithm  is  repeated  five  times  using  different  randomly 
chosen  training  sets  to  avoid  any  bias  induced  by  random 
sampling.  The  window  size  is  fixed  at  9  x  9  for  KSOMP  and 
KSSP  due  to  its  smoothness.  The  OA  on  the  test  set,  averaged 
over  five  independent  realizations,  are  shown  in  Fig.  3.  The  bars 
indicate  the  maximal  and  minimal  accuracies  in  five  runs  at 
each  point,  and  we  see  that  the  the  fluctuation  is  usually  within 
2%  and  within  1%  in  a  majority  of  cases.  One  can  observe  from 
Fig.  3(a)  and  (b)  that  for  the  pixel- wise  kernel  sparsity  model, 


(b) 

I  Oats 
|  Soybeans-notill 
|  Soybeans-min 
|  Soybean-clean 
■  Wheat 
|  Woods 

Building-Grass-T  rees 
1H  Stone-steel  Towers 
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TABLE  II 

Classification  Accuracy  (%)  for  the  Indian  Pines  Image  Using  10%  Training  Samples  as  Shown  in  Fig.  1  and  Table  I 


Class 

SVM 

SVMCK 

KLR 

KLRCK 

OMP 

KOMP 

SOMP 

KSOMP 

KOMPCK 

SP 

KSP 

SSP 

KSSP 

KSPCK 

1 

81.25 

95.83 

64.58 

75.00 

68.75 

72.92 

85.42 

97.92 

97.92 

68.75 

72.92 

81.25 

91.67 

95.83 

2 

86.28 

96.67 

89.46 

96.43 

65.97 

86.36 

94.88 

97.21 

99.22 

74.65 

87.91 

95.74 

97.98 

99.15 

3 

72.80 

90.93 

70.67 

95.47 

60.67 

77.47 

94.93 

96.67 

96.93 

63.20 

78.53 

92.80 

97.73 

96.93 

4 

58.10 

85.71 

67.14 

86.19 

38.57 

62.86 

91.43 

93.33 

95.24 

40.00 

62.86 

82.38 

96.67 

97.14 

5 

92.39 

93.74 

90.60 

96.42 

89.49 

90.38 

89.49 

95.75 

98.43 

89.04 

90.60 

93.29 

94.85 

98.21 

6 

96.88 

97.32 

98.07 

98.66 

95.24 

97.17 

98.51 

99.55 

99.11 

95.98 

96.88 

98.81 

98.96 

99.11 

7 

43.48 

69.57 

17.39 

82.61 

21.74 

21.74 

91.30 

60.87 

100 

21.74 

21.74 

82.61 

17.39 

100 

8 

98.86 

98.41 

98.86 

97.95 

97.05 

98.18 

99.55 

100 

100 

99.09 

98.64 

99.77 

100 

99.97 

9 

50 

55.56 

16.67 

50 

33.33 

55.56 

0 

0 

88.89 

61.11 

55.56 

0 

0 

100 

10 

71.53 

93.80 

74.97 

93.80 

68.20 

77.61 

89.44 

94.60 

98.05 

70.72 

79.33 

91.27 

94.37 

97.70 

11 

84.38 

94.37 

84.87 

95.54 

75.96 

85.68 

97.34 

99.28 

97.43 

77.94 

86.90 

97.43 

98.33 

98.20 

12 

85.51 

93.66 

81.16 

91.85 

54.53 

77.90 

88.22 

95.65 

98.73 

61.23 

78.44 

89.13 

97.46 

98.73 

13 

100 

99.47 

100 

100 

100 

100 

100 

100 

100 

100 

100 

99.47 

100 

100 

14 

93.30 

99.14 

95.02 

96.56 

92.87 

95.70 

99.14 

99.83 

99.40 

95.62 

95.96 

99.05 

99.91 

99.48 

15 

64.91 

87.43 

61.70 

88.01 

41.23 

55.85 

99.12 

91.81 

97.95 

48.25 

55.56 

97.95 

97.08 

97.37 

16 

88.24 

100 

57.65 

88.24 

94.12 

92.94 

96.47 

91.76 

97.65 

92.94 

94.12 

92.94 

94.12 

95.29 

OA 

84.52 

94.86 

84.78 

95.10 

74.78 

85.26 

95.28 

97.33 

98.33 

78.10 

86.09 

95.34 

97.46 

98.47 

AA 

79.24 

90.73 

73.05 

89.55 

68.61 

78.02 

88.45 

88.39 

97.81 

72.52 

78.50 

87.12 

86.03 

98.31 

K 

0.823 

0.941 

0.826 

0.944 

0.712 

0.832 

0.946 

0.970 

0.981 

0.749 

0.841 

0.947 

0.971 

0.983 

Ground  Truth 

et 

fir 


OMP,  OA  =  74.78 

‘^Sii 


i§£ 


wear 


SP,  OA  =  78.10 


1BN 

m 


SVM,  OA  =  84.52 

*gp 


SVMCK,  OA  =  94.86 


IfiSF  *&n?-  *fiff 
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KOMP,  OA  =  85.26 

*8jj|h 
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fir 
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9 


KOMPCK,  OA  =  98.33 
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Fig.  2.  Classification  maps  and  overall  classification  accuracy  (OA)  for  the  Indian  Pines  image  on  labeled  pixels  with  10%  training  samples. 


7  =  512  leads  to  the  highest  OA  at  all  sparsity  levels.  For  a 
fixed  7,  the  performance  of  KOMP  and  KSP  generally  improves 
as  Kq  increases  and  tends  to  saturate  as  K0  reaches  30-50.  For 
KSOMP  and  KSSP,  as  shown  in  Fig.  3(c)  and  (d),  the  same 
tendency  cannot  be  observed.  However,  the  kernel  JSM  is  more 
stable  than  the  pixel- wise  model,  as  for  a  large  range  of  sparsity 
level  Kq  and  sufficiently  large  7,  the  OA  is  always  around  96% 
with  a  small  variance.  The  stable  performance  suggests  that  we 
could  also  use  empirical  parameters  Kq  and  7. 


The  next  two  HSIs  used  in  our  experiments,  the  University 
of  Pavia  and  the  Center  of  Pavia  images,  are  urban  images 
acquired  by  the  Reflective  Optics  System  Imaging  Spectrom¬ 
eter  (ROSIS).  The  ROSIS  sensor  generates  115  spectral  bands 
ranging  from  0.43  to  0.86  (im  and  has  a  spatial  resolution  of 
1.3  m  per  pixel  [42].  The  University  of  Pavia  image  consists 
of  610  x  340  pixels,  each  having  103  bands,  with  the  12  most 
noisy  bands  removed.  There  are  nine  ground-truth  classes  of 
interests,  as  shown  in  Table  III.  For  this  image,  we  follow  the 
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Fig.  3.  Effect  of  sparsity  level  Kq  and  RBF  kernel  parameter  7  on  the  Indian  Pines  image  using  (a)  KOMP,  (b)  KSP,  (c)  KSOMP,  and  (d)  KSSP. 


TABLE  III 

The  Nine  Ground-Truth  Classes  in  the 
University  of  Pavia  Image 


same  experiment  settings  for  the  training  and  test  sets  as  used 
in  [30],  [42],  in  which  about  9%  of  labeled  data  are  used  as 
training,  and  the  rest  are  used  for  testing,  as  shown  in  Table  III 
and  Fig.  4. 

The  classification  accuracies  and  the  ft  coefficients  on  the 
test  set  using  various  techniques  are  shown  in  Table  IV,  and 


Class 

Samples 

No 

Name 

Train 

Test 

1 

Asphalt 

548 

6304 

2 

Meadows 

540 

18146 

3 

Gravel 

392 

1815 

4 

Trees 

524 

2912 

5 

Metal  sheets 

265 

1113 

6 

Bare  soil 

532 

4572 

7 

Bitumen 

375 

981 

8 

Bricks 

514 

3364 

9 

Shadows 

231 

795 

Total 

3921 

40002 

■  Asphalt 
Meadows 
Gravel 

|  T rees 

■  Metal  sheets 

■  Bare  soil 

■  Bitumen 

■  Bricks 
Shadows 

(a)  (b) 

Fig.  4.  (a)  Training  and  (b)  test  sets  for  the  University  of  Pavia  image. 

the  classification  maps  for  all  labeled  pixels  are  presented  in 
Fig.  5.  Again,  the  RBF  kernel  is  used  for  all  kernel-based  algo¬ 
rithms.  This  urban  image  lacks  the  large  spatial  homogeneity. 
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TABLE  IV 

Classification  Accuracy  (%)  for  the  University  of  Pavia  Image  Using  3921 
(Around  9%)  Training  Samples  as  Shown  in  Fig.  4  and  Table  III 


Class 

SVM 

SVMCK 

KLR 

KLRCK 

OMP 

KOMP 

SOMP 

KSOMP 

KOMPCK 

SP 

KSP 

SSP 

KSSP 

KSPCK 

1 

84.30 

79.85 

82.96 

74.40 

68.23 

76.09 

59.33 

94.23 

82.23 

69.78 

76.67 

69.59 

89.56 

89.64 

2 

67.01 

84.86 

83.34 

85.91 

67.04 

69.61 

78.15 

76.74 

72.47 

67.90 

70.92 

72.31 

79.98 

72.68 

3 

68.43 

81.87 

64.13 

61.71 

65.45 

72.12 

83.53 

79.23 

82.26 

69.20 

73.39 

74.10 

85.45 

80.06 

4 

97.80 

96.36 

96.33 

96.22 

97.29 

98.11 

96.91 

95.12 

98.56 

96.77 

98.15 

95.33 

98.66 

98.94 

5 

99.37 

99.37 

99.19 

99.10 

99.73 

99.73 

99.46 

100 

99.82 

99.64 

99.82 

99.73 

99.91 

100 

6 

92.45 

93.55 

80.05 

84.45 

73.27 

87.66 

77.41 

99.50 

93.92 

78.96 

89.70 

86.72 

95.76 

94.77 

7 

89.91 

90.21 

84.51 

85.32 

87.26 

88.07 

98.57 

99.80 

92.46 

88.18 

88.28 

90.32 

97.96 

89.81 

8 

92.42 

92.81 

83.17 

93.37 

81.87 

89.51 

89.09 

98.78 

78.78 

83.68 

87.54 

90.46 

96.43 

89.54 

9 

97.23 

95.35 

89.81 

96.48 

95.97 

93.96 

91.95 

29.06 

96.98 

94.59 

95.22 

90.94 

98.49 

96.48 

OA 

79.15 

87.18 

83.56 

84.77 

73.30 

78.33 

79.00 

85.67 

81.07 

74.86 

79.18 

78.39 

87.65 

83.19 

AA 

87.66 

90.47 

84.83 

86.33 

81.79 

86.10 

86.04 

85.83 

88.61 

83.19 

86.63 

85.50 

93.58 

90.21 

K 

0.737 

0.833 

0.784 

0.799 

0.661 

0.725 

0.728 

0.815 

0.758 

0.681 

0.735 

0.724 

0.840 

0.785 

SP,  OA  =  74.86 


SVM,  OA  =  79.15 


SVMCK,  OA  =  87.18 


KLR,  OA  =  83.56 


KSOMP,  OA  =  85.67 


KSSP,  OA  =  87.65 


KLRCK,  OA  =  84.77 


Fig.  5.  Classification  maps  and  overall  classification  accuracy  (OA)  for  the  University  of  Pavia  image  using  around  9%  labeled  samples  as  training  set. 
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Fig.  7.  Relative  performance  gain  by  kernelization  for  sparsity-based  classi- 
Fig.  6.  Effect  of  dictionary  size  for  (a)  the  Indian  Pines  image  and  (b)  the  fiers  0n  (a)  the  Indian  Pines  image  and  (b)  the  University  of  Pavia  image  with 
University  of  Pavia  image.  different  dictionary  size. 


Therefore,  a  smaller  neighborhood  of  size  5  x  5  is  optimal 
for  algorithms  using  a  CK,  and  the  linear  and  kernel  JSMs. 
Similar  to  the  Indian  Pines  image,  the  proposed  KSOMP/KSSP 
algorithms  achieve  better  or  comparable  performance  when 
compared  with  the  SVMCK  classifier  for  most  of  the  classes. 
KSOMP  yields  the  best  accuracy  in  five  out  of  the  total  nine 
classes,  and  KSSP  has  the  highest  OA,  AA,  and  ft  coefficient. 
The  overall  performance  of  SVM,  KOMP,  and  KSP,  which 
are  kernel  methods  for  pixel-wise  models,  are  comparable, 
and  by  incorporating  the  contextual  information,  the  SVMCK, 
KSOMP,  and  KSSP  techniques  still  have  comparable  perfor¬ 
mance.  The  sparsity-based  algorithms  generally  do  not  handle 
the  second  class,  representing  Meadows,  very  well.  For  exam¬ 
ple,  the  accuracy  for  the  second  class  for  KSOMP  and  KSSP 
is  5%-9%  lower  than  that  for  SVMCK  and  KLRCK,  which 
affect  the  OA  because  this  class  contains  more  than  45%  of  the 
samples  in  the  entire  test  set.  This  could  be  circumvented  by 
selecting  or  learning  a  more  representative  training  set  which  is 
sufficiently  comprehensive  to  span  the  class  subspace. 


In  the  sequel,  we  examine  how  the  number  of  training 
samples  affects  the  classification  performance  for  various  algo¬ 
rithms  on  the  Indian  Pines  and  the  University  of  Pavia  images. 
The  algorithm  parameters  are  fixed  to  be  the  same  as  those 
used  to  generate  the  results  in  Tables  II  and  IV.  For  the  Indian 
Pines  image,  in  each  test,  we  randomly  choose  1%  to  30%  of 
the  labeled  data  in  each  class  as  the  training  samples  and  the 
remaining  samples  as  the  test  ones.  The  classification  accuracy 
plots  under  various  conditions  are  shown  in  Fig.  6(a)  for  the 
Indian  Pines  image,  where  the  x-axis  denotes  the  percentage 
of  training  samples  from  the  total  available  labeled  samples, 
and  the  y- axis  is  the  OA  on  the  test  set.  The  accuracies  are 
averaged  over  five  runs  for  each  classifier  at  each  percentage 
level  to  avoid  any  bias  induced  by  random  sampling,  and  the 
bars  indicate  the  maximal  and  minimal  accuracies  for  each 
point  in  the  five  runs.  The  OA  monotonically  increases  as  the 
size  of  training  set  increases,  and  the  variance  is  small  (the 
difference  between  the  maximum  and  minimum  is  within  1%) 
when  at  least  5%-10%  training  samples  become  available.  The 
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Fig.  8.  Relative  performance  gain  by  contextualization  for  various  classifiers 
on  (a)  the  Indian  Pines  image  and  (b)  the  University  of  Pavia  image  with 
different  dictionary  size. 

KOMPCK  and  KSPCK  consistently  yield  higher  OA  than  any 
other  classifiers. 

For  the  University  of  Pavia  image,  we  create  a  balanced 
dictionary  by  randomly  choosing  L  =  10,  20,  30,  50,  100,  and 
200  training  samples  per  class,  and  these  training  samples  are 
a  subset  of  the  entire  training  set  shown  in  Fig.  4(a).  Since 
the  dictionary  is  considerably  small,  the  sparsity  level  K0  is 
set  to  be  no  more  than  L.  The  classification  accuracy  plots 
are  shown  in  Fig.  6(b),  where  the  x-axis  denotes  the  number 
of  training  samples  per  class,  and  the  y- axis  is  the  overall 
classification  accuracy  on  the  test  set.  Again,  the  accuracies  are 
averaged  over  five  runs  for  each  classifier  at  each,  L  and  the 
bars  represent  the  maximum  and  minimum  in  the  five  runs.  It 
is  obvious  that  in  most  cases,  the  OA  increases  monotonically, 
and  the  variance  decreases  as  the  number  of  training  samples 
increases.  For  the  University  of  Pavia  image,  the  performance  at 
L  =  50  is  almost  the  same  as  that  at  L  =  100  for  all  classifiers. 
The  SVMCK  classifier  consistently  outperforms  all  of  the  other 
classifiers  when  the  number  of  training  samples  is  small,  but 


TABLE  V 

The  Nine  Ground-Truth  Classes  in  the  Center  of 
Pavia  Image  and  the  Training  and  Test  Sets 


Class 

Samples 

No 

Name 

Train 

Test 

1 

Water 

745 

64533 

2 

Trees 

785 

5722 

3 

Meadow 

797 

2094 

4 

Brick 

485 

1667 

5 

Soil 

820 

5729 

6 

Asphalt 

678 

6847 

7 

Bitumen 

808 

6479 

8 

Tile 

223 

2899 

9 

Shadow 

195 

1970 

Total 

5536 

97940 

|  Water 
|  T rees 
Meadow 
|  Brick 

■  Soil 

|  Asphalt 
|  Bitumen 

■  Tile 
Shadow 


Fig.  9.  (a)  Training  and  (b)  test  sets  for  the  Center  of  Pavia  image. 


the  curves  for  SVMCK  and  KSSP  tend  to  converge  as  more 
training  samples  are  available  (see  Table  IV  for  the  performance 
comparison  of  SVMCK  and  KSSP  with  a  large  training  set).  It 
should  also  be  pointed  out  again  that  during  the  training  stage 
of  algorithms  using  a  CK,  in  order  to  extract  the  local  spatial 
features  for  each  training  sample,  one  requires  knowledge  of 
the  neighboring  pixels  or  the  location  of  the  training  sample, 
which  may  not  be  available  in  the  training  set.  Moreover,  the 
proposed  sparsity-based  algorithms  rely  on  the  approximation 
accuracy  from  each  class  subdictionary.  Therefore,  if  the  size  of 
the  subdictionary  is  too  small,  the  training  samples  may  not  be 
sufficient  to  faithfully  represent  the  subspace  associated  with 
each  class,  leading  to  a  lower  classification  accuracy  than  the 
discriminative  classifier  SVM. 

A  closer  inspection  of  the  performance  gain,  as  a  function  of 
the  dictionary  size,  obtained  by  kernelization  and  contextual¬ 
ization  is  shown  in  Figs.  7  and  8,  respectively.  The  y- axis  rep¬ 
resents  the  relative  gain  in  percentage  (averaged  over  five  runs), 
which  is  the  ratio  between  the  improvement  in  accuracy  and 
the  OA  of  the  algorithm  before  kernelization/contextualization. 
For  example,  in  the  case  of  contextualization  of  KOMP  using 
the  JSM,  the  relative  gain  is  computed  by 


9  = 


OAksomp  —  OAkomp 


*  100% 


OAkomp 
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TABLE  VI 

Classification  Accuracy  (%)  for  the  Center  of  Pavia  Image  Using  5536  Training  Samples 
(Around  5%  of  All  Labeled  Samples  as  Shown  in  Fig.  9  and  Table  V) 


Class 

SVM 

SVMCK 

KLR 

KLRCK 

OMP 

KOMP 

SOMP 

KSOMP 

KOMPCK 

SP 

KSP 

SSP 

KSSP 

KSPCK 

1 

99.19 

97.46 

99.63 

100 

98.91 

98.13 

99.32 

99.07 

98.98 

98.20 

98.09 

97.79 

99.26 

98.79 

2 

77.74 

93.08 

93.18 

95.39 

86.75 

92.76 

92.38 

95.30 

96.31 

86.98 

91.17 

92.82 

91.23 

91.70 

3 

86.74 

97.09 

96.18 

95.89 

96.04 

97.04 

95.46 

97.09 

96.08 

96.61 

97.28 

97.80 

97.71 

99.57 

4 

40.38 

77.02 

81.76 

89.80 

81.22 

88.84 

85.66 

89.68 

97.78 

84.16 

86.86 

78.52 

95.26 

94.54 

5 

97.52 

98.39 

96.25 

98.59 

94.40 

94.89 

96.37 

97.56 

97.82 

94.01 

95.76 

95.81 

97.45 

94.99 

6 

94.77 

94.32 

93.91 

96.67 

91.94 

96.13 

92.83 

98.31 

96.54 

92.92 

95.82 

96.52 

97.41 

93.92 

7 

74.37 

97.50 

95.22 

97.31 

93.18 

95.40 

94.68 

98.80 

98.63 

93.80 

95.57 

95.96 

97.82 

96.90 

8 

98.94 

99.83 

99.52 

98.41 

98.62 

99.34 

99.69 

99.93 

100 

98.79 

99.24 

99.79 

99.90 

99.55 

9 

100 

99.95 

99.90 

99.49 

98.07 

99.39 

98.68 

100 

96.65 

99.34 

99.39 

98.83 

71.42 

93.60 

OA 

94.63 

96.81 

97.99 

98.92 

96.68 

97.19 

97.66 

98.53 

98.46 

96.40 

97.08 

96.93 

97.82 

97.55 

AA 

85.52 

94.96 

95.06 

96.84 

93.24 

95.77 

95.01 

97.30 

97.64 

93.87 

95.47 

94.87 

94.16 

95.95 

K 

0.899 

0.943 

0.963 

0.980 

0.940 

0.949 

0.958 

0.973 

0.972 

0.935 

0.947 

0.945 

0.960 

0.956 

where  OAKsomp  and  OAKomp  are  the  OA  for  the  KSOMP 
and  KOMP  algorithms,  respectively.  The  relative  gain  obtained 
by  the  kernelization  of  the  SP,  SSP,  OMP,  and  SOMP  algorithms 
is  shown  in  Fig.  7(a)  and  (b)  for  the  Indian  Pines  and  University 
of  Pavia  images,  respectively.  One  can  observe  that  in  most 
cases,  kernelization  consistently  leads  to  a  performance  gain  of 
5%  to  20%.  The  only  exception  exists  in  the  KSSP  and  KSOMP 
algorithms  for  the  Indian  Pines  image  with  a  higher  percentage 
of  training  samples,  which  is  partly  due  to  the  fact  that  SSP 
and  SOMP  before  kernelization  already  achieve  an  OA  of  at 
least  95%.  In  this  case,  an  improvement  of  2%  to  3%  means 
the  error  rate  is  reduced  by  half  which  could  be  considered 
significant. 

The  relative  gain  obtained  by  incorporation  of  spatial  infor¬ 
mation  in  OMP,  KOMP,  SVM,  and  KLR  is  shown  in  Fig.  8(a) 
and  (b)  for  the  Indian  Pines  image  and  the  University  of  Pavia 
image,  respectively.  The  contextualization  of  SP  and  KSP  has  a 
similar  effect  to  that  of  OMP  and  KOMP,  and  the  results  are  not 
reported  here.  Note  that  with  the  kernel  sparse  representation 
model,  the  contextual  correlation  can  be  incorporated  through 
either  a  JSM  or  a  CK,  and  thus  the  relative  gain  of  KSOMP 
(through  JSM)  and  KOMPCK  (through  CK)  over  KOMP  are 
both  shown  in  Fig.  8.  One  can  observe  that  for  the  India  Pines 
image,  the  linear  method  OMP  is  the  most  sensitive  to  the 
spatial  information,  in  which  the  relative  gain  is  generally  more 
than  20%.  The  other  classifiers  all  work  in  the  feature  space, 
and  the  gain  ranges  from  10%  to  15%  in  most  cases,  with  a 
slight  decrease  as  the  number  of  training  samples  increases.  For 
the  University  of  Pavia  image,  the  relative  gain  in  classification 
accuracy  is  usually  around  10%  to  14%.  Contrary  to  the  case 
of  the  Indian  Pines  image,  the  improvement  of  the  linear  ap¬ 
proach  OMP  is  slightly  less  than  the  kernel  methods.  Moreover, 
the  performance  of  KLR  is  not  as  consistent  as  the  other 
methods. 

The  third  image  in  our  experiments,  Center  of  Pavia,  is  the 
other  urban  image  collected  by  the  ROSIS  sensor  over  the  cen¬ 
ter  of  the  Pavia  city.  This  image  consists  of  1096  x  492  pixels, 
each  having  102  spectral  bands  after  13  noisy  bands  are 
removed.  The  nine  ground-truth  classes  and  the  number  of 
training  and  test  samples  for  each  class  are  shown  in  Table  V 
and  illustrated  in  Fig.  9.  For  this  image,  about  5%  of  the  labeled 
data  are  used  as  training  samples.  The  classification  results 
are  summarized  in  Table  VI,  and  the  classification  maps  are 
shown  in  Fig.  10.  KLRCK  achieves  a  100%  accuracy  on  the 


first  class  (water),  which  occupies  66%  of  the  test  set,  and  thus 
yields  the  best  OA.  The  KSOMP  and  KSSP  work  very  well 
on  the  other  classes,  except  that  KSSP  fails  for  the  ninth  class 
(Shadow). 

In  general,  one  can  observe  from  the  experimental  results  on 
these  three  images  that  the  incorporation  of  contextual  infor¬ 
mation  improves  the  classification  performance  (e.g.,  SP  versus 
SSP,  KSP  versus  KSSP,  SVM  versus  SVMCK,  etc).  Moreover, 
operating  in  the  kernel  feature  space  also  significantly  improves 
the  accuracy  (e.g.,  SP  versus  KSP,  SSP  versus  KSSP,  etc). 

V.  Conclusion 

In  this  paper,  we  propose  a  new  HSI  classification  tech¬ 
nique  based  on  sparse  representations  in  a  nonlinear  feature 
space  induced  by  a  kernel  function.  The  spatial  correlation 
between  neighboring  pixels  is  incorporated  through  a  JSM. 
Experimental  results  on  AVIRIS  and  ROSIS  HSIs  show  that 
the  kernelization  of  the  sparsity-based  algorithms  improves 
the  classification  performance  compared  to  the  linear  version. 
It  is  also  shown  that  the  proposed  algorithm  has  a  better  or 
comparable  performance  to  the  recent  spectral-spatial  single¬ 
classifiers  such  as  SVMCK. 

The  proposed  sparsity-based  classifier  is  different  from  the 
conventional  sparse  classifier  SVM  in  many  aspects.  SVM  is 
a  discriminative  model,  which  finds  the  separating  hyperplane 
between  two  classes.  A  model  with  a  fixed  set  of  sparse  support 
vectors  is  obtained  by  a  training  process  over  the  whole  training 
set,  and  then  this  SVM  is  used  to  classify  all  of  the  test  data. 
However,  our  method  can  be  considered  as  a  generative  model. 
The  subspaces  representing  different  classes  implicitly  compete 
with  each  other  during  the  sparse  recovery  process,  leading  to  a 
discriminative  representation  vector.  This  sparse  representation 
vector  is  extracted  for  each  test  pixel  and  is  thus  adaptive.  This 
will  inevitably  lead  to  an  increase  in  the  computational  cost, 
but  the  kernel  matrix  K a  can  be  computed  offline.  Therefore, 
the  most  intensive  part  in  the  sparse  recovery  is  the  inversion 
of  a  matrix  of  at  most  size  Kq  x  AT0  for  the  OMP-based 
algorithms  and  (2Kq)  x  (2ATo)  for  the  SP-based  algorithms. 
Moreover,  in  the  OMP-based  algorithms,  since  the  support  set 
is  sequentially  augmented  by  one  index  at  a  time,  the  inversion 
can  be  accelerated  by  Cholesky  decomposition  [45]. 

Our  proposed  dictionary -based  classifier  provides  several 
advantages.  First,  new  training  samples  can  be  easily  added 
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Fig.  10.  Classification  maps  and  overall  classification  accuracy  (OA)  for  the  Center  of  Pavia  image  using  5536  training  samples  (around  5%  of  all  labeled 
samples)  as  shown  in  Fig.  10  and  Table  V. 
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to  the  dictionary  without  retraining  the  model,  unlike  the  other 
classifiers  (e.g.,  SVM  and  KLR)  that  need  to  retrain  the  model 
for  the  new  training  data.  Also,  our  algorithm  is  particularly 
useful  for  creating  a  dictionary  invariant  to  the  environmental 
variations  by  adding  synthetically  generated  spectral  signa¬ 
tures  that  account  for  various  illuminations  and  atmospheric 
conditions  [46].  Moreover,  the  JSM  in  kernel  space  is  still 
applicable  when  the  training  data  is  synthetically  generated  or 
from  a  spectra  library  rather  than  taken  from  the  scene.  On  the 
other  hand,  in  order  to  incorporate  local  spatial  information, 
classifiers  using  composite-kernels  require  knowledge  of  the 
local  spatial  features  of  each  training  data  which  may  not  be 
available,  and  thus  these  classifiers  may  not  be  applicable  in  the 
case  of  unstructured  synthetic  or  library  training  sets. 

The  classification  accuracy  can  be  further  improved  by  a 
post-processing  step,  or  combining  the  proposed  technique  with 
other  state-of-the-art  classifiers  to  generate  a  megaclassifier 
[17].  Another  possible  direction  is  the  design/learning  of  a 
better  dictionary  such  that  the  dictionary  provides  more  accu¬ 
rate  reconstruction,  more  discriminative  power,  and/or  better 
adaptivity  to  the  test  data. 
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